The goal of this problem set is to develop some intuition about the impact of the number of nodes in the hidden layer of the neural network. We will use few simulated examples to have clear understanding of the structure of the data we are modeling and will assess how performance of the neural network model is impacted by the structure in the data and the setup of the network.
First of all, to compensate for lack of coverage on this topic in ISLR, let’s go over a couple of simple examples. We start with simulating a simple two class dataset in 2D predictor space with an outcome representative of an interaction between attributes. (Please notice that for the problems you will be working on this week you will be asked below to simulate a dataset using a different model.)
# fix seed so that narrative always matches the plots:
set.seed(1234567890)
nObs <- 1000
ctrPos <- 2
xyTmp <- matrix(rnorm(4*nObs),ncol=2)
xyCtrsTmp <- matrix(sample(c(-1,1)*ctrPos,nObs*4,replace=TRUE),ncol=2)
xyTmp <- xyTmp + xyCtrsTmp
gTmp <- paste0("class",(1+sign(apply(xyCtrsTmp,1,prod)))/2)
plot(xyTmp,col=as.numeric(factor(gTmp)),pch=as.numeric(factor(gTmp)),xlab="X1",ylab="X2")
abline(h=0)
abline(v=0)
Symbol color and shape indicate class. Typical problem that will present a problem for any approach estimating a single linear decision boundary. We used similar simulated data for the random forest (week 10) problem set.
Simulate data with n=1000 observations and p=3 covariates – all random values from standard normal distribution (\(\mathcal{N}\left(0,1\right), \mu=0,\sigma=1\)). Create two category class variable assigning all observations within a cube centered at \((0,0,0)\) with the edge length of \(2.5\) (i.e. with vertices at \((1.25,1.25,1.25)\), \((1.25,1.25,-1.25)\), …, \((-1.25,-1.25,-1.25)\)) – to one class category and observations outside of this cube – to the second class category. Confirm that the simulated observations are nearly evenly split between the two classes.
Please note that this dataset is entirely different from the one used in the preface – you will need to write code simulating it on your own. Somewhat related 2D dataset was used as a motivational example at week 12 (SVM) lecture before introducing kernels and SVMs. However, the example in the lecture was in 2D (three-dimensional problem here) and had a circular (in 3D – spherical) boundary (here we work with a cube as a decision surface). Since you will be reusing this code in the following two problems it is probably best to turn this procedure into a function with appropriate parameters.
Ten points available for this problem are composed of accomplishing the following tasks:
Correct implementation of the function generating data as prescribed above (3 points)
#simulate n = 1000obs, p = 3 covariates
set.seed(4123)
gendata = function(n){
#generate data points
x = rnorm(n)
y = rnorm(n)
z = rnorm(n)
#classify points
cl = numeric(n)
cl[between(x, -1.25, 1.25) == F] = 1 # -1.25 > x, y, or z > 1.25 = outside of boundary
cl[between(y, -1.25, 1.25) == F] = 1
cl[between(z, -1.25, 1.25) == F] = 1
df = data.frame(x, y, z, cl)
return(df)
}
data = gendata(1000)Check and demonstrate numerically that the resulting class assignment splits these observations, subject to sampling variability, evenly between these two groups (3 points)
class = as.factor(data$cl)
summary(class)
## 0 1
## 464 536
barplot(summary(class), names.arg = c('within', 'outside'), main = 'class distribution', ylab = 'n', col = 'pink')
Plot values of the resulting covariates projected at each pair of
the axes indicating classes to which observations belong with symbol
color and/or shape (you can use function pairs, for
example) (2 points)
scatterplot3d(data$x, data$y, data$z, color = data$cl + 1)
Reflect on the geometry of this problem by answering the following question: what is the smallest number of planes in 3D space that would completely enclose points from the “inner” class? Is this number equal to the number of cube faces or is it something smaller? Larger? Please note that “enclose” above does not mean “perfectly discriminate between the points assigned to two classes” (2 points)
4 planes, in the shape of a pyramid, would also be able to completely enclose the points from the inner class, which is smaller than the number of cube faces.
The boundary for assigning observations to the inner class used above – \(\max_i |X_i| \leq 1.25,i=\{1,2,3\}\) – splits observations from standard normal in 3D nearly evenly between the two classes, but not quite exactly 50/50. Please derive mathematical expression for the value to be used instead of 1.25 in the expression above to contain exactly half of probability density within the inner cube, present its numerical value to the 9-th decimal place and demonstrate numerically that it results in closer to equal split of the simulated observation between the two classes than \(1.25\).
Please note that this is not a problem of devising an algorithm equally splitting a given dataset in two, but purely probabilistic/mathematical question – what threshold to use for a cube size above that will split such dataset exactly evenly on average / in the limit of infinitely large sample size (i.e. that contains precisely half of the probability density of standard normal distribution in 3D).
For the dataset simulated above fit neural networks with 1 through 6
nodes in a single hidden layer (use neuralnet
implementation). For each of them calculate training error (see an
example in Preface where it was calculated using err.fct
field in the result returned by neuralnet). Simulate
another independent dataset (with n=10,000 observations to make
resulting test error estimates less variable) using the same procedure
as above (3D standard normal, two classes, cube as a decision surface)
and use it to calculate the test error at each number of hidden nodes.
Plot training and test errors as function of the number of nodes in the
hidden layer. What does resulting plot tells you about the interplay
between model error, model complexity and problem geometry? What is the
geometrical interpretation of this error behavior?
set.seed(999)
test = gendata(10000) #generate new data
train = data
neurn = function(train, test, plot = T, v = 0, ...){
#store error results
trainsse = numeric()
testerr = numeric()
trainerr = numeric()
for (h in 1:6){
#fit neural net
nn = neuralnet(cl ~ x + y + z, data = train, hidden = h, err.fct = 'sse', threshold = 0.1)
#training SSE error
trainsse[h] = nn$result.matrix[1]
#predict test & training
pred = predict(nn, test)
predtr = predict(nn, train)
#error results test & train
conf = confusionMatrix(data = as.factor(as.numeric(pred > 0.5)), reference = as.factor(test$cl))
testerr[h] = (1 - conf$overall[1]) * 100 #1 - accuracy as percentage
conftr = confusionMatrix(data = as.factor(as.numeric(predtr > 0.5)), reference = as.factor(train$cl))
trainerr[h] = (1 - conftr$overall[1]) * 100
}
#plot error results
if (plot == TRUE){ #optional plotting
par(mfrow = c(1, 3))
plot(trainsse, xlab = 'nodes', ylab = 'SSE', main = 'Training Error (SSE)', type = 'b')
plot(testerr, xlab = 'nodes', ylab = '1 - Accuracy (%)', main = 'Testing Error (%)', type = 'b', ylim = c(0, 100))
plot(trainerr, xlab = 'nodes', ylab = '1 - Accuracy (%)', main = 'Training Error (%)', type = 'b', ylim = c(0, 100))
title(paste("Test nObs:", nrow(test), " Train nObs:", nrow(train), " Null Variables:", v), outer = T, line = -33)
}
invisible(data.frame(testerr, trainerr)) #return error but suppress output
}
neurn(train, test)
As model complexity increases, testing and training error decreases, up to a certain point. At one node, error is high and decreases with the addition of more nodes. Testing error is slightly higher than training error. Error decreases while the number of nodes increases to 4, which is the minimum error as calculated by confusion matrix accuracy. When the model complexity increases to more than 4 nodes, testing and training error both increase slightly. Training error as calculated by neural network SSE continues to decrease with 5 and 6 nodes. Geometrically, this suggests that 4 planes/decision boundaries would effectively classify that data and the use of 6 would not significantly improve results.
Setup a simulation repeating procedure described above for n=100, 200
and 500 observations in the training set as well as adding
none, 1, 2 and 5 null variables to the training and test data (and to
the covariates in formula provided to neuralnet). Draw
values for null variables from standard normal distribution as well and
do not use them in the assignment of the observations to the class
category
(e.g. x<-matrix(rnorm(600),ncol=6); cl<-as.numeric(factor(rowSums(abs(x[,1:3])<1.25)==3))
creates dataset with three informative and three null variables). Repeat
calculation of training and test errors at least several times for each
combination of sample size, number of null variables and size of the
hidden layer simulating new training and test dataset every time to
assess variability in those estimates. Present resulting error rates so
that the effects of sample size and fraction of null variables can be
discerned and discuss their impact of the resulting model fits.
#function to add null variables
nulldata = function(null, nobs){
d = gendata(nobs) #generate and classify informative variables
if (null != 0){ #skip if not adding null variables
for (v in 1:null){
null = rnorm(nobs)
d = data.frame(d, null) #add null variables to dataset
}
}
return(d)
}
#nobs = 100, 200, 500 training
#null = 0, 1, 2, 5 training & test
nobs = c(100, 200, 500)
null = c(0, 1, 2, 5)
for (n in nobs){
for (v in null){
train = nulldata(v, n)
test = nulldata(v, 10000)
neurn(train, test)
}
}
terrdf = numeric(6)
trerrdf = numeric(6)
for (i in 1:7){ #repeat neuralnet 7 times
nobs = c(100, 200, 500)
null = c(0, 1, 2, 5)
for (n in nobs){
for (v in null){
train = nulldata(v, n)
test = nulldata(v, 10000)
nn = neurn(train, test, plot = F)
terrdf = data.frame(terrdf, nn$testerr) #dataframe of all error results
trerrdf = data.frame(trerrdf, nn$trainerr)
}
}
}
terrdf = terrdf[,-1]
trerrdf = trerrdf[,-1]
#List of plot titles
title = character()
nobs = c(100, 200, 500)
null = c(0, 1, 2, 5)
for (n in nobs){
for (v in null){
title = append(title, (paste("nObs:", n, " null:", v)))
}}
#separate error by nobs/null conditions & plot
for (i in 0:6){
t = cbind(terrdf[1+i], terrdf[7+i], terrdf[14+i], terrdf[21+i], terrdf[28+i], terrdf[35+i] + terrdf[42+i])
boxplot(t(t), main = paste("Test Error", title[i+1]))
tr = cbind(trerrdf[1+i], trerrdf[7+i], trerrdf[14+i], trerrdf[21+i], trerrdf[28+i], trerrdf[35+i] + trerrdf[42+i])
boxplot(t(tr), main = paste("Training Error", title[i+1]), col = 'pink')
}
Error decreases with increasing number of observations. Increasing the amount of null variables; however, can increase the error and result in much more jagged error curves across the amount of nodes. In particular, 5 null variables shows worsening error rates.
Use neuralnet to model the outcome in WiFi localization
dataset that we used in previous weeks using outcome in its
original, four-levels format. Your neuralnet models
will need to have four output nodes to represent outcome with four
levels. Obtain training and test error for this model for several sizes
of the hidden layer. Compare resulting test error in predicting Location
3 to that observed for random forest, SVM and KNN approaches earlier in
the course.
For reproducibility purposes it is always a good idea to capture the state of the environment that was used to generate the results:
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.2.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] caret_6.0-94 lattice_0.20-45 dplyr_1.1.0
## [4] scatterplot3d_0.3-43 ggplot2_3.4.2 neuralnet_1.44.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.10 lubridate_1.9.2 listenv_0.9.0
## [4] class_7.3-20 digest_0.6.31 ipred_0.9-14
## [7] foreach_1.5.2 utf8_1.2.3 parallelly_1.35.0
## [10] R6_2.5.1 plyr_1.8.8 stats4_4.2.2
## [13] hardhat_1.3.0 e1071_1.7-13 evaluate_0.20
## [16] highr_0.10 pillar_1.8.1 rlang_1.1.0
## [19] data.table_1.14.8 jquerylib_0.1.4 rpart_4.1.19
## [22] Matrix_1.5-1 rmarkdown_2.20 splines_4.2.2
## [25] gower_1.0.1 stringr_1.5.0 munsell_0.5.0
## [28] proxy_0.4-27 compiler_4.2.2 xfun_0.37
## [31] pkgconfig_2.0.3 globals_0.16.2 htmltools_0.5.4
## [34] nnet_7.3-18 tidyselect_1.2.0 tibble_3.2.1
## [37] prodlim_2023.03.31 codetools_0.2-18 fansi_1.0.4
## [40] future_1.32.0 withr_2.5.0 ModelMetrics_1.2.2.2
## [43] MASS_7.3-58.3 recipes_1.0.5 nlme_3.1-160
## [46] jsonlite_1.8.4 gtable_0.3.1 lifecycle_1.0.3
## [49] magrittr_2.0.3 pROC_1.18.0 scales_1.2.1
## [52] stringi_1.7.12 future.apply_1.10.0 cli_3.6.0
## [55] cachem_1.0.6 reshape2_1.4.4 timeDate_4022.108
## [58] bslib_0.4.2 generics_0.1.3 vctrs_0.6.1
## [61] lava_1.7.2.1 iterators_1.0.14 tools_4.2.2
## [64] glue_1.6.2 purrr_1.0.1 parallel_4.2.2
## [67] fastmap_1.1.0 survival_3.4-0 yaml_2.3.7
## [70] timechange_0.2.0 colorspace_2.1-0 knitr_1.42
## [73] sass_0.4.5
The time it took to knit this file from beginning to end is about (seconds):
proc.time() - ptStart
## user system elapsed
## 47.193 2.562 50.470